BA project

Authors
Affiliation

Dayneth Calderon Flores, Ivana

Université de Lausanne

Jaouad, Dorra

Krafft, Mathilde

1 The Financial and Critical Impact of Female Representation in Films: A Study of Bechdel Test Outcomes

1.1 Background and Motivation

The Bechdel Test serves as a widely recognized measure of female representation in films, highlighting important discussions surrounding diversity and gender equality within the film industry. This topic is significant as it not only addresses the representation of women but also engages with broader societal issues such as gender stereotypes and equity.

The intersection of cultural metrics, exemplified by the Bechdel Test, and economic variables, including a film’s budget and revenue, renders this research particularly compelling. By exploring how diversity impacts a film’s commercial and critical success, the study examines two key dimensions: cultural representation and its potential economic and critical consequences.

This research aligns with the overarching goal of leveraging data-driven insights to comprehend the effects of cultural dynamics on creative industries. Through a combination of statistical, predictive, and exploratory analyses, the project aims to address questions of both social significance and commercial relevance.

1.2 Project Objectives

1.2.1 Main Objective

The primary objective of this project is to investigate the relationship between the Bechdel Test and other film-related factors, such as budget, genre, and director, on a film’s revenue. The specific aims include:

  • Analyzing whether passing the Bechdel Test has a statistically significant effect on domestic gross or overall revenue.

  • Assessing if films that pass the Bechdel Test tend to win more awards or achieve higher IMDb ratings.

1.2.2 Sub-goals

  1. Predictive Modeling: Developing a predictive model to estimate a film’s success based on various factors, including the Bechdel Test result, its correlation with awards, and IMDb ratings. This model will provide insights into the probability of a film passing the Bechdel Test based on attributes such as budget, genre, and director.

  2. Correlation Exploration: Investigating correlations between a film’s performance on the Bechdel Test and its revenue, as well as its awards and IMDb ratings, to uncover potential patterns and insights.

  3. Trend Visualization: Visualizing trends between films that either pass or fail the Bechdel Test and their associated economic data, including revenues, budgets, awards won, and IMDb ratings. This visualization will help elucidate the relationship between female representation and film success.

1.3 Research Questions

In this section, we pose the following research questions to explore the potential influence of female representation in films on commercial and critical success. By examining correlations between Bechdel Test results and metrics like revenue, awards, and IMDb ratings, we aim to identify patterns that link gender diversity with performance outcomes. Additionally, we explore predictive models to assess whether factors such as budget, genre, and director can reliably estimate both a film’s success and its likelihood of passing the Bechdel Test. These questions guide our analysis to uncover data-driven insights into the economic and critical impact of gender representation in the film industry.

  1. Is there a significant correlation between passing the Bechdel test and a film’s revenue?
  2. Do films that pass the Bechdel test tend to win more awards or receive higher IMDb ratings? 
  3. Can we build a predictive model to estimate a film’s success (revenue, awards, IMDb ratings) based on the Bechdel test results and other factors like budget, genre, and director?
  4. Can we develop a predictive model to determine whether a film will pass the Bechdel test based on factors such as budget, genre, and director?

1.4 Data Collection

In this section, we outline the process of data collection and merging. The data sources for this project were obtained from data.world, where we can access to the following datasets:

  • Dataset df: It contains information regarding the Bechdel test outcomes for various films. The information was gather from BechdelTest.com and The-Numbers.com

  • Dataset df2: It includes additional film-related data, such as budget, revenue, genre, director, and awards. This information comes from the Imdb dataset.

The two datasets were successfully merged into a single primary data frame using the IMDb ID as the key. This process resulted in a consolidated dataset comprising 1,792 rows. Specific columns of interest were selected to facilitate further analysis and ensure the relevance of the data to our research objectives.

Show the code
# Read json file
df <- fromJSON("https://query.data.world/s/mel73qzc35dvjcs54h4x4tatmvveaj?dws=00000")

# Read csv file
df2 <- read.csv("https://query.data.world/s/zosxjqvjiygclw2wix74pcp4vyesvb?dws=00000", header=TRUE, stringsAsFactors=FALSE);

#Merge the two dataset
merged_data <- merge(df2, df, by.x = "imdb", by.y = "imdbID", all = FALSE)

#Remove unwanted columns
df_final <- subset(merged_data, select = -c(domgross_2013.,Writer,Year, Actors, imdb, title, test, clean_test, domgross, intgross, Plot, code, period.code,decade.code, budget, Rated, Response, Metascore, Released, Runtime, Type, Poster, Error ))

1.5 Data Cleaning

In the data cleaning section, we ensure the dataset is prepared for effective analysis by addressing any inconsistencies, missing values, and irrelevant entries. This process involves examining each variable to standardize formats, handle null values, and remove duplicates, ultimately enhancing the accuracy and reliability of the Exploratory Data Analysis and modeling phases. By refining the dataset, we lay a strong foundation for meaningful insights and trustworthy results in subsequent analyses.

  1. Standardizing Column Names
    This section focuses on renaming columns to create a consistent and clear naming convention across the dataset, ensuring easier readability and usability for analysis.
Show the code
#Rename column

df_final <- df_final %>% 
  rename( Year = year, Bechdel_test_result = binary, Budget = budget_2013., Revenue = intgross_2013., Movie_Title = Title, Country_of_Origin = Country, Imdb_Rating = imdbRating, Imdb_Votes = imdbVotes)
  1. Isolating Individual Values by Language, Genre, and Director
    Here, we separated entries with multiple values in the Language, Genre, and Director columns, creating distinct rows for each value. This transformation enables more granular analysis by ensuring each entry corresponds to a single language, genre, or director.
Show the code
# Seperate the language

df_final <- df_final %>%
  separate_rows(Language, sep = ", ")

# Seperate the Genre
df_final <- df_final %>%
  separate_rows(Genre, sep = ", ")

# Seperate the Director
df_final <- df_final %>%
  separate_rows(Director, sep = ", ")
  1. Extracting and Summarizing Award Wins and Nominations
    In this section, we parsed the Awards column to isolate and categorize Oscar and other award wins and nominations. By extracting relevant numerical values, we created separate columns for Wins and Nominations, enabling a more detailed analysis of award performance. Intermediate columns were removed to streamline the dataset.
Show the code
# Extract Wins and Nominations
# Updated code to process each row individually
df_final <- df_final %>%
  rowwise() %>%
  mutate(
    # Extract numbers after "Won " and before "wins"
    oscar_wins = replace_na(
      str_extract(Awards, "(?i)(\\d+)(?= win(s)?)") %>%
        str_replace_all("Another ", "") %>%
        as.numeric(), 
      0),
    
    # Extract numbers before "win(s)"
    other_wins = replace_na(
      str_extract(Awards, "(?i)\\bWon (\\d+)") %>%
        str_replace_all("Won ", "") %>%
        as.numeric(),
      0),
    
    Wins = rowSums(across(c(oscar_wins, other_wins))),
    
    oscar_nominations = replace_na(
      str_extract(Awards, "(?i)\\bNominated for (\\d+)") %>%
        str_replace_all("Nominated for ", "") %>%
        as.numeric(), 
      0),
    other_nominations = replace_na(
      str_extract(Awards, "(?i)(\\d+)(?= nomination(s)?)") %>%
        str_replace_all("Nominated ", "") %>%
        as.numeric(),
      0),
    
    Nominations = rowSums(across(c(oscar_nominations, other_nominations))),
  ) %>%
  select(-Awards, -oscar_wins, -other_wins, -oscar_nominations, -other_nominations) %>%  # Remove intermediate columns
  ungroup() # Ungroup after rowwise operations
  1. Integrating Director Gender for Analysis

    To explore the potential impact of a director’s gender on a film’s performance in the Bechdel Test, we extended our dataset by merging it with an external table that maps names to gender. This table associates names with specific genders, where:

    • male = 1

    • Female = 0

    • Unisex = 3

    For the purposes of our analysis, we assume these name-gender associations are accurate (e.g., “James” is classified as male). Given that only 101 out of 1,792 records are marked as unisex or have missing gender values, we decided to exclude these records when examining gender’s impact on other variables.

  1. Reconstructing the Director’s Full Name
    In this step, we recreate a column that combines the director’s first and last name. If the last name is missing, we use only the first name; otherwise, both names are concatenated. This provides a consistent format for the director’s name across the dataset, which we store in the new column Director_Name. We then remove the Name and Last_Name columns to streamline the dataset.
Show the code
# Recreate a column with the directors name
df_final_gender$Director_Name <- ifelse(
  is.na(df_final_gender$Last_Name), 
  df_final_gender$Name, 
  paste(df_final_gender$Name, df_final$Last_Name)
)

df_final <- df_final_gender %>%
  select(-Name, -Last_Name)
  1. Converting Bechdel Test Results to Binary Format
    We create a binary column, Bechdel_binary, to indicate whether each film passed the Bechdel Test, where a “PASS” result is represented as 1 and a “FAIL” result as 0. This binary format allows for easier analysis and statistical modeling.
Show the code
# Adding a column of Bechdel_test_result as binary
df_final$Bechdel_binary <- as.factor(ifelse(df_final$Bechdel_test_result == "PASS", 1, 0))
  1. Ensuring Correct Variable Types for Exploratory Data Analysis
    To prepare for exploratory data analysis, we set appropriate data types for each variable. GenderGenreLanguageCountry_of_OriginBechdel_test_result, and Bechdel_binary are converted to categorical variables, while RevenueImdb_Rating, and Imdb_Votes are transformed to numeric types for accurate calculations.
                               Variable      Type
Year                               Year   integer
Bechdel_test_result Bechdel_test_result    factor
Budget                           Budget   integer
Revenue                         Revenue   numeric
Language                       Language    factor
Movie_Title                 Movie_Title character
Country_of_Origin     Country_of_Origin    factor
Imdb_Rating                 Imdb_Rating   numeric
Genre                             Genre    factor
Imdb_Votes                   Imdb_Votes   numeric
Wins                               Wins   numeric
Nominations                 Nominations   numeric
Gender                           Gender    factor
Director_Name             Director_Name character
Bechdel_binary           Bechdel_binary    factor
  1. Handling Missing Values
    After checking for missing values, we find that Revenue and Gender have some NA entries. Since these missing values could impact analysis, we remove rows with any NA values to maintain data integrity in the final dataset.
Show the code
na_counts <- colSums(is.na(df_final)) ## 32 NA in Revenue and 342 in Gender
df_final <- na.omit(df_final)
  1. Variables and Descriptions after the cleaning
  • Year (Release Date)
    Description: The year when the film was released to the public.
    Type: Integer.

  • Bechdel Test Result (Passed/Failed)
    Description: This variable indicates whether the film passed or failed the Bechdel test, a measure of gender representation in films.
    Type: Categorical/factor with possible values “Passed” or “Failed.”

  • Budget (in $)
    Description: The production budget of the film in U.S. dollars, adjusted for inflation to reflect 2013 dollar values.
    Type: Numeric, stored as an integer.

  • Revenue (in $)
    Description: The total revenue or box office earnings of the film in U.S. dollars, adjusted for inflation to 2013 prices.
    Type: Numeric.

  • Language
    Description: The primary language(s) in which the film was released or broadcast, represented by specific languages such as “English” or “French.”
    Type: Categorical/factor.

  • Movie Title
    Description: The name of the film, provided as text.
    Type: Text (Character).

  • Country of Origin
    Description: The country where the film was produced or primarily distributed, provided as the country’s name (e.g., “USA”, “France”).
    Type: Categorical/factor.

  • IMDb Ratings
    Description: The film’s IMDb rating, often based on user reviews and ratings on the IMDb platform.
    Type: Numeric.

  • Genre (Action, Drama, Comedy, etc.)
    Description: The primary genre(s) describing the film, such as Action, Drama, or Comedy.
    Type: Categorical/factor.

  • IMDb Votes
    Description: The number of user reviews for the film’s IMDb rating on the IMDb platform.
    Type: Numeric.

  • Wins
    Description: The number of awards the film has won (e.g., “Golden Globe”, “Oscar”).
    Type: Numeric.

  • Nominations
    Description: The number of nominations for awards the film has received.
    Type: Numeric.

  • Gender (Director’s Gender)
    Description: The gender of the film’s director.
    Type: Categorical/factor.

  • Director Name
    Description: The name of the director of the film.
    Type: Text (Character).

  • Bechdel Binary
    Description: This variable indicates whether the film passed (1) or failed (0) the Bechdel test.
    Type: Categorical/factor with possible values “Passed = 1” or “Failed = 0.”

1.6 Exploratory Data Analysis

This exploratory analysis aims to uncover relationships between female representation in films, as measured by the Bechdel test, and their commercial success as indicated by revenue and awards. By visualizing the data and assessing correlations among these variables, we can gain insights into the impact of diversity in film on its critical and financial performance.

  1. Creating a Deduplicated Dataset for Analysis
    In this step, we create a deduplicated dataset based on unique film titles. This dataset will be used for analysis in all columns except for Genre and Language. By grouping the data by Movie_Title and selecting the first occurrence of each variable, we ensure that only one entry per film is considered for analysis. We then summarize the dataset to count the number of distinct values in each column, providing an overview of the diversity within the data.
Show the code
# Create a deduplicated dataset based on unique film titles
df_unique <- df_final %>% 
  group_by(Movie_Title) %>% 
  summarise(across(everything(), ~ first(.)))

# Summarize the number of distinct values for each column
summary_data <- df_unique %>%
  summarise(across(everything(), ~ n_distinct(.)))

# Convert the summary data to an interactive table using DT
summary_data_table <- datatable(summary_data, options = list(pageLength = 5))

# Print the interactive table
summary_data_table
  1. Summarizing Categorical Variables
    In this step, we summarize the categorical variables, including Bechdel_test_result, Genre, Gender, and Bechdel_binary, by counting the occurrences of each unique category. This provides an overview of the distribution of these variables in the dataset, which is essential for understanding their role in the analysis.
Show the code
# Summarize the Bechdel test result, Genre, and Gender with counts
bechdel_test_summary <- df_unique %>%
  count(Bechdel_test_result, name = "count") %>%
  rename(Category = Bechdel_test_result) %>%
  mutate(Variable = "Bechdel_test_result") # Add the variable name

genre_summary <- df_unique %>%
  count(Genre, name = "count") %>%
  rename(Category = Genre) %>%
  mutate(Variable = "Genre of the film") # Add the variable name

gender_summary <- df_unique %>%
  count(Gender, name = "count") %>%
  rename(Category = Gender) %>%
  mutate(Variable = "Gender of Directors") # Add the variable name

# Create interactive tables with DT
bechdel_test_table <- datatable(bechdel_test_summary, options = list(pageLength = 5))
genre_summary_table <- datatable(genre_summary, options = list(pageLength = 5))
gender_summary_table <- datatable(gender_summary, options = list(pageLength = 5))

# Print the interactive tables
bechdel_test_table
Show the code
genre_summary_table
Show the code
gender_summary_table
  1. Summarizing Numerical Variables
    In this step, we summarize the key numerical columns, including Budget, Revenue, Imdb_Rating, Imdb_Votes, Wins, and Nominations. This summary provides descriptive statistics for these variables, offering insights into their distribution and range, which are essential for understanding their role in the analysis of film performance.
Show the code
# Summarize the numerical columns
numerical_summary <- summary(df_unique[, c("Budget", "Revenue", "Imdb_Rating", "Imdb_Votes", "Wins", "Nominations")])

# Convert the summary into a data frame for easier viewing
numerical_summary_df <- as.data.frame(numerical_summary)

# Convert the numerical summary data to an interactive table using DT
numerical_summary_table <- datatable(numerical_summary_df, options = list(pageLength = 6))

# Print the interactive table
numerical_summary_table

1.6.1 Correlation Heatmap: Revenue, Budget, Imdb_Rating, Wins, Nominations, Bechdel_binary

In this step, we present a correlation heatmap that displays the relationships between key variables: Revenue, Budget, IMDb Rating, Wins, Bechdel_binary and Nominations. By examining the correlation coefficients between these variables, this heatmap provides insights into the degree and direction of association among financial, critical, and award-related metrics. Positive correlations indicate that as one variable increases, the other tends to increase as well, while negative correlations suggest an inverse relationship. This visualization serves as a preliminary analysis, helping us identify which factors are most closely linked, and providing a basis for further exploration of the economic and critical impacts of female representation in films.

Show the code
# Select relevant columns including Bechdel_binary (which has values 0 and 1)
num_cols <- df_unique %>% 
  select(Revenue, Budget, Imdb_Rating, Wins, Nominations, Bechdel_binary)  %>%
  mutate(across(everything(), as.numeric))

# Calculate the correlation matrix with complete observations only
corr_matrix <- cor(num_cols, use = "complete.obs")

# Convert correlation matrix to long format for ggplot
corr_long <- corr_matrix %>%
  as.data.frame() %>%
  rownames_to_column(var = "Var1") %>%
  pivot_longer(cols = -Var1, names_to = "Var2", values_to = "value")

# Create the heatmap with correlation values displayed on the tiles
heatmap <- ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1)) +
  geom_text(aes(label = round(value, 2)), color = "black", size = 4) +  # Display correlation values
  labs(title = "Correlation Heatmap: Revenue, Budget, Imdb_Rating, Wins, Nominations, Bechdel Test", 
       x = "", y = "", fill = "Correlation") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),  # Center the title
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Make the heatmap interactive with Plotly
ggplotly(heatmap)

1.6.2 Interpretation:

1.6.2.1 High Positive Correlations:

  1. Budget and Revenue (0.60): There is a strong positive correlation between budget and revenue. This suggests that films with higher production budgets tend to generate higher box office revenues. Higher budgets typically allow for better marketing, production, and star power, contributing to greater commercial success.

  2. Wins and Nominations (0.79): A very strong positive correlation, indicating that films with more nominations generally win more awards. This highlights the likelihood that films nominated for prestigious awards are often the ones that take home the honors.

1.6.2.2 Moderate Correlations:

  1. IMDb Rating and Nominations (0.48): There is a moderate positive correlation between IMDb ratings and nominations. Films with better ratings from audiences tend to receive more nominations, suggesting that critical success often paves the way for recognition in award circles.

1.6.2.3 Weak or No Correlations:

  1. Budget and IMDb Rating (-0.04): There is virtually no correlation between the budget of a film and its IMDb rating, indicating that the amount spent on production does not necessarily correlate with higher user ratings. Audience ratings are influenced by various other factors, such as the story, acting, and direction.

  2. Wins and IMDb Rating (0.06): The correlation between wins and IMDb ratings is very weak, suggesting that winning awards does not strongly impact the audience’s ratings. A film can be successful in terms of awards but may not necessarily receive high ratings from the audience.

  3. Bechdel Binary (0.00 to -0.15 with other variables): The Bechdel binary variable (indicating whether a film passed or failed the Bechdel test) shows very weak correlations with other variables, suggesting that passing the Bechdel test does not have a significant relationship with budget, revenue, IMDb rating, or wins. For example:

    • Bechdel_binary and Revenue (-0.10): A very weak negative correlation, meaning that whether a film passes the Bechdel test has almost no influence on its revenue.

    • Bechdel_binary and Budget (-0.15): A weak negative correlation, showing that films passing the Bechdel test are slightly less likely to have larger budgets, though the relationship is not strong.

    • Bechdel_binary and IMDb Rating (-0.13): A slightly negative correlation, indicating a small negative relationship between passing the Bechdel test and IMDb ratings. However, the correlation is weak.

1.6.3 Revenue Distribution by Bechdel Test Outcome

In this step, we examine the distribution of movie revenues based on whether films pass or fail the Bechdel Test. Using a violin plot overlaid with boxplots, we visualize the log-transformed revenue to handle any large variances in revenue data, which allows us to better observe patterns across Bechdel Test outcomes. This visualization helps us assess whether movies that pass the Bechdel Test tend to have different revenue distributions compared to those that fail, providing preliminary insights into the financial impact of female representation in films.

Show the code
# Check for any missing or NA values in the Revenue column to ensure data integrity
df_unique <- df_unique %>% drop_na(Revenue)

# Create the interactive plot with specified colors
plot <- ggplot(df_unique, aes(x = factor(Bechdel_binary), y = log(Revenue), fill = factor(Bechdel_binary))) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, alpha = 0.8, outlier.shape = NA) +
  scale_fill_manual(values = c("0" = "#FF7F50", "1" = "#4CAF50"), labels = c("Fail", "Pass")) +
  scale_x_discrete(labels = c("0" = "Fail", "1" = "Pass")) +
  labs(
    title = "Movie Revenue Distribution by Bechdel Test Result",
    subtitle = "Log of Revenue",
    x = "Bechdel Test Result",
    y = "Revenue (Log Scale)",
    fill = "Bechdel Test Result\n(0 = Fail, 1 = Pass)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 10),
    legend.position = "bottom",
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12)
  )


ggplotly(plot)

1.6.4 Interpretation:

According this violin plot that illustrates the revenue distribution for films based on their Bechdel Test results. Films that fail the Bechdel Test generally show a higher median revenue compared to those that pass. Additionally, the revenue distribution for failing films has greater variability, with some reaching notably high revenues, while others fall to lower extremes, as shown by the extended tails of the violin plot. In contrast, films that pass the Bechdel Test have a more condensed revenue distribution, indicating less variability and a slightly lower median revenue.

Overall, this suggests that films failing the Bechdel Test may achieve higher revenues on average, though the reasons behind this trend likely involve other factors, such as genre or marketing, which would require further analysis to clarify.

1.6.5 Revenue Distribution by Genre and Bechdel Test Outcome

In this section we use a boxplot that compares film revenues across different genres, with results segmented by whether the films pass or fail the Bechdel Test.

Show the code
# interactive plot
plot <- ggplot(df_final, aes(x = Genre, y = Revenue, fill = factor(Bechdel_binary))) +
  geom_boxplot(outlier.shape = NA, alpha = 0.8) +  # Set transparency for better visibility
  scale_y_log10() +  # Log scale to handle revenue variability
  labs(
    title = "Revenue Distribution by Genre and Bechdel Test Result",
    subtitle = "Comparing Revenue Across Genres by Bechdel Test Outcome",
    x = "Genre", 
    y = "Revenue (Log Scale)", 
    fill = "Bechdel Test Result\n(0 = Fail, 1 = Pass)"
  ) +
  scale_fill_manual(values = c("0" = "#FF7F50", "1" = "#4CAF50"),  # Orange for Fail, Green for Pass
                    labels = c("Fail", "Pass")) +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),  # Centered title
    plot.subtitle = element_text(hjust = 0.5, size = 10),  # Centered subtitle
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),  # Rotate genre labels
    axis.title = element_text(size = 12),
    legend.position = "bottom"  # Move legend to bottom for more space
  )

# Convert to interactive plot
ggplotly(plot)

1.6.6 Interpretation:

The plot suggests that while revenue varies significantly across genres, the Bechdel Test result does not have a consistent impact on revenue within each genre. Certain genres like Adventure, Sci-Fi, and Action show generally higher revenues, but this trend appears irrespective of the Bechdel Test outcome.

1.6.7 Revenue vs. IMDb Rating Scatter Plot with Bechdel Test Outcome

This scatter plot visualization provides insight into the relationship between a film’s IMDb rating and its revenue, while highlighting the results of the Bechdel Test as a key variable. It aims to observe any trends in rating and revenue associated with gender representation. A smooth trend line added to the data helps to highlight overarching patterns, while a logarithmic scale on the revenue axis accounts for the large revenue disparities often found across films.

Show the code
# Load necessary libraries
library(ggplot2)
library(plotly)

# Your ggplot code
p <- ggplot(df_unique, aes(x = Imdb_Rating, y = Revenue, color = factor(Bechdel_binary))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE) +  # Add trend lines without confidence intervals
  scale_y_log10() + 
  scale_color_manual(values = c("0" = "#FF7F50", "1" = "#4CAF50")) +  # Set colors for Bechdel_binary
  labs(title = "Revenue vs. IMDb Rating by Bechdel Test Result",
       x = "IMDb Rating", y = "Log of Revenue", color = "Bechdel Test Result\n(0 = Fail, 1 = Pass)") +
  theme_minimal()

ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'

Passing the Bechdel Test does not appear to have a major impact on revenue, as both passing and failing movies follow similar trends in terms of IMDb Rating and Revenue. Instead, IMDb rating itself seems to be the stronger indicator of revenue potential across both groups, though this effect is relatively modest

1.6.8 Wins by Bechdel Test Result

Show the code
# Using dplyr to create summary statistics
summary_stats <- df_unique %>%
  group_by(Bechdel_binary) %>%
  summarise(
    Mean_Wins = mean(Wins, na.rm = TRUE),
    Median_Wins = median(Wins, na.rm = TRUE),
    Count = n()
  ) %>%
  mutate(
    Mean_Wins = round(Mean_Wins, 2),
    Median_Wins = round(Median_Wins, 2)
  )

summary_stats
# A tibble: 2 × 4
  Bechdel_binary Mean_Wins Median_Wins Count
  <fct>              <dbl>       <dbl> <int>
1 0                   8.94           3   948
2 1                   8.81           3   754

We see that failing the test is more common. The average number of win remains the same but the change is that 50% of films that passed the test had at least 4 awards and for those failing the test, 50% of them had at least 3 awards.

1.6.9 IMDb Ratings by Bechdel Test Result

Show the code
# Create the interactive plot
plot <- ggplot(df_unique, aes(x = factor(Bechdel_binary), y = Imdb_Rating, fill = factor(Bechdel_binary))) +
  geom_boxplot(outlier.shape = NA) +
  labs(title = "IMDb Ratings by Bechdel Test Result",
       x = "Bechdel Test Result", y = "IMDb Rating",  fill = "Bechdel Test Result\n(0 = Fail, 1 = Pass)") +
  scale_fill_manual(values = c("0" = "#FF7F50", "1" = "#4CAF50")) +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot)

The visual suggests that movies that failed the test had better IMDb ratings, we can statistically test if this is significant or not.

Show the code
# Separate IMDb ratings into two groups based on Bechdel Test result
fail_ratings <- df_unique$Imdb_Rating[df_unique$Bechdel_binary == 0]
pass_ratings <- df_unique$Imdb_Rating[df_unique$Bechdel_binary == 1]

# Perform independent samples t-test
t_test_result <- t.test(pass_ratings, fail_ratings, alternative = "greater")

# Print the t-test result
print(t_test_result)

    Welch Two Sample t-test

data:  pass_ratings and fail_ratings
t = -5.5603, df = 1621.6, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -0.3307857        Inf
sample estimates:
mean of x mean of y 
 6.629045  6.884283 

Indeed, the test shows that movies with failed Bechdel test had better IMDb rating than succeeded Bechdel test.

1.6.10 Director’s gender impact

This section will examine how the director’s gender influences various aspects of the data, including budget, revenue, IMDb score, and the Bechdel test results.

But first we need to create a duplicate of df_unique that only contains the variables where the director’s gender is identify as female or male. So we need to remove the unisex (3) value.

Show the code
# Create a new dataframe with only Gender 0 and 1
df_gender <- df_unique %>% 
  filter(Gender %in% c(0, 1))

# Display the first few rows to check the result
# head(df_gender)

df_gender <- df_gender %>%
  mutate(Director_gender = ifelse(Gender == 1, "Male", "Female"))

In our dataset, only 7% of movies were directed by female directors, highlighting a significant gender disparity in the film industry. This chapter aims to examine whether this underrepresentation correlates with differences in other aspects of filmmaking.

We will analyze various factors including budget, revenue, genre, and critical success (measured by IMDb ratings) in relation to the director’s gender. Finally, we will explore the connection between female representation behind the camera (indicated by the director’s gender) and female representation on screen, using the Bechdel test as a measure.

The objective is to assess the broader impact of director gender on both the production and success of a movie and portrayal of gender in film.

1.6.11 The director’s gender impact on budget

1.6.12 Interpretation : Budget VS Gender

This bloxplot shows that male directors get higher budget than the female counterparts.

Show the code
# Run a t-test to compare the mean log-transformed budget between male and female directors
t_test_result <- t.test(log(Budget) ~ Director_gender, data = df_gender)

# Display the result
# t_test_result

The t-test results indicate that there is a statistically significant difference in mean budgets between movies directed by male and female directors, with male-directed movies receiving, on average, a higher budget. The p-value (< 0.0001) and the confidence interval (entirely below 0) both support this conclusion, suggesting a meaningful difference in budget allocation based on the director’s gender.

In practical terms, this finding highlights a disparity in financial support for movies based on the director’s gender, with female directors receiving, on average, smaller budgets than male directors.

1.6.13 The director’s gender impact on Revenue

Show the code
# Calculate mean and median revenue in millions
revenue_summary <- df_gender %>%
  group_by(Director_gender) %>%
  summarize(
    Mean_Revenue_Millions = mean(Revenue / 1e6, na.rm = TRUE),
    Median_Revenue_Millions = median(Revenue / 1e6, na.rm = TRUE),
    Count = n()
  ) %>%
  pivot_longer(cols = c(Mean_Revenue_Millions, Median_Revenue_Millions), names_to = "Statistic", values_to = "Revenue_Millions")

# Set unique labels for combinations of Director_gender and Statistic for color mapping
revenue_summary <- revenue_summary %>%
  mutate(Color_Label = paste(Director_gender, Statistic, sep = "_"))

# Plot the mean and median revenue in millions with custom colors
plot <- ggplot(revenue_summary, aes(x = as.factor(Director_gender), y = Revenue_Millions, fill = Color_Label)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    x = "Director Gender",
    y = "Revenue (in Millions)",
    title = "Mean and Median Revenue by Director Gender (in Millions)",
    fill = "Statistic"
  ) +
  scale_fill_manual(values = c(
    "Female_Mean_Revenue_Millions" = "#DB9ADB",
    "Female_Median_Revenue_Millions" = "#e35fe3",
    "Male_Mean_Revenue_Millions" = "#4062DB",
    "Male_Median_Revenue_Millions" = "#0d1bdb"
  )) +
  theme_minimal()


# Convert the plot to an interactive plot
interactive_plot <- ggplotly(plot)

# Display the interactive plot
interactive_plot

1.6.14 Interpretation Revenue VS Gender

Show the code
# Run a t-test to compare the mean revenue between male and female directors
t_test_result <- t.test(Revenue ~ Director_gender, data = df_gender)
# Display the result
# t_test_result

The bar plot reveals a substantial disparity in movie revenues between male and female directors, indicating that films directed by women generate, on average, about half the revenue of those directed by men. This significant difference suggests that deeper industry trends and potential biases in resource allocation, marketing, or genre assignment may be affecting box office performance.

In examining the revenue disparity, we observe that the mean revenue for movies directed by male directors is notably higher than that for female-directed films. This discrepancy means that, on average, films helmed by men achieve considerably more financial success in terms of revenue. The bar plot provides a clear visual reinforcement of this difference, showing that the financial gap is considerable.

Statistical testing further supports the significance of this revenue difference, confirming that it is highly unlikely to have occurred by random chance. The t-test’s low p-value (<0.001) indicates that this gap is consistent enough across the sample to suggest a systematic pattern rather than an isolated anomaly. The implication here is that gender-based revenue differences are reflective of broader trends in the film industry, not just singular outliers.

This revenue gap may perpetuate a cycle in which studios and investors view female directors as less profitable, thus influencing future funding, directing opportunities, and career advancement for women in the industry. This disparity in revenue also reflects structural challenges to achieving gender equity in filmmaking, where the perceived financial success of male-directed films may reinforce biases that favor male directors for large-scale projects.

1.6.15 The director’s gender impact on Critical success

Show the code
plot <- ggplot(
  data = df_gender,
  mapping = aes(x = Director_gender, y = Imdb_Rating, fill = Director_gender)
) +
  geom_boxplot(outlier.shape = NA) +  # Hide outliers if they clutter the plot
  scale_fill_manual(values = c("Female" = "#DB9ADB", "Male" = "#4062DB")) +  # Color for male (0) and female (1)
  labs(
    title = "Budget Distribution by Director Gender",
    x = "Director's Gender",
    y = "Imdb Ratings",
    fill = "Gender"
  ) +
  theme_minimal()

# Convert the plot to an interactive plot
interactive_plot <- ggplotly(plot)

# Display the interactive plot
interactive_plot

1.6.16 Interpretation Critical success VS Gender

Show the code
# Run a t-test to compare the mean revenue between male and female directors
t_test_result <- t.test(Imdb_Rating ~ Director_gender, data = df_gender)
# Display the result
# t_test_result

The gender of a movie’s director appears to significantly impact the film’s critical success, as shown by a statistically significant difference in scores, with a p-value <0.001. This low p-value indicates a very strong likelihood that the observed difference in critical success between male- and female-directed films is not due to chance alone but rather reflects a consistent pattern.

1.6.17 The director’s gender impact on the Bechdel test

Show the code
# Calculate counts and percentages by Gender and Bechdel Test result
data_gender_bechdel <- df_gender %>%
  group_by(Director_gender, Bechdel_binary) %>%
  summarize(count = n(), .groups = "drop") %>%
  group_by(Director_gender) %>%
  mutate(percentage = count / sum(count) * 100)

# Plot
plot <- ggplot(data_gender_bechdel, aes(x = as.factor(Director_gender), y = count, fill = as.factor(Bechdel_binary)))  +
  geom_bar(stat = "identity", width = 0.5, position = "fill") +  # Use position = "fill" for percentage stack
  labs(
    x = "Director Gender",  # Label for x-axis
    y = "Percentage of Movies",                    # Label for y-axis
    title = "Percentage of Films Passing the Bechdel Test by Director Gender",
    fill = "Bechdel Test Result\n(0 = Fail, 1 = Pass)"
  ) +
  scale_fill_manual(values = c("0" = "#FF7F50", "1" = "#4CAF50")) +  # Customize colors for pass/fail
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10)
  ) +
  geom_text(
    aes(label = paste0(round(percentage, 1), "%")),
    position = position_fill(vjust = 0.5),  # Center text within each stack
    color = "white",
    size = 3
  )

# Convert the plot to an interactive plot
interactive_plot <- ggplotly(plot)

# Display the interactive plot
interactive_plot

Unsurprisingly, female directors are better in passing Bechdel test.

Show the code
# Count Bechdel_binary values grouped by Gender
counts <- df_unique %>%
  group_by(Gender, Bechdel_binary) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(
    Gender_desc = case_when(
      Gender == 0 ~ "Female",
      Gender == 1 ~ "Male",
      Gender == 3 ~ "Unisex"
    ),
    Bechdel_binary_desc = ifelse(Bechdel_binary == 0, "Fail", "Pass")
  ) %>%
  select(Gender_desc, Bechdel_binary_desc, count)  # Select only the semantic columns

# Display the results with semantic descriptions on the left 
# print(counts)
Show the code
# Create a formatted table with gt
counts_table <- counts %>%
  gt() %>%
  tab_header(
    title = "Bechdel Test Results by Director Gender"
  ) %>%
  cols_label(
    Gender_desc = "Director Gender",
    Bechdel_binary_desc = "Bechdel Test Result",
    count = "Count"
  ) %>%
  fmt_number(
    columns = c(count),
    decimals = 0  # No decimals for count
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )

# Display the table
counts_table
Bechdel Test Results by Director Gender
Director Gender Bechdel Test Result Count
Female Fail 23
Female Pass 90
Male Fail 878
Male Pass 615
Unisex Fail 47
Unisex Pass 49

1.6.18 Interpretation Bechdel test VS Gender

Show the code
# Run a chi-squared test to compare Bechdel test results by Director gender
table_data <- table(df_gender$Director_gender, df_gender$Bechdel_binary)
chi_squared_test <- chisq.test(table_data)

# Display the result
# chi_squared_test

The identity of a movie’s director significantly impacts nearly every aspect of the film, from its budget to its financial and critical success. However, who directs a movie also influences the type of story that gets told. Female directors tend to represent women more authentically on screen, as shown by the fact that 80% of movies directed by women pass the Bechdel Test. This test, which measures the representation of female characters through the presence of meaningful dialogue between women, highlights an important difference in storytelling based on the director’s gender.

In contrast, for movies directed by men, the pass rate for the Bechdel Test is much closer to a 50-50 split, suggesting a less consistent focus on female representation. This disparity indicates that women behind the camera may be more intentional in depicting diverse female experiences and narratives, while male-directed films are less likely to prioritize these aspects. Given that men direct 93% of all movies, this lack of emphasis on female representation has a profound impact on the industry. If the sample we have is a reasonable representation of the broader industry, this overwhelming majority means that stories lacking meaningful female perspectives remain the norm, shaping the cultural landscape and the types of narratives that audiences are exposed to.

Ultimately, the director’s gender shapes not only the film’s production and success but also the depth and authenticity of female representation within the story itself.

1.6.19 Trend Analysis Over Time (Percentage of Movies Passing Bechdel Test)

Show the code
# Calculate the percentage of movies passing the Bechdel test by year
df_trend <- df_unique %>%
  group_by(Year) %>%
  summarize(pass_rate = mean(Bechdel_binary == 1, na.rm = TRUE) * 100)

# Create the interactive plot
plot <- ggplot(df_trend, aes(x = Year, y = pass_rate)) +
  geom_line(color = "blue") +
  geom_point(color = "blue") +
  labs(title = "Percentage of Movies Passing the Bechdel Test Over Time",
       x = "Year", y = "Percentage Passing Bechdel Test") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot)

This final graph shows that the percentage of films passing the Bechdel Test hasn’t increased over time, despite growing calls for better representation. In fact, female representation in movies was stronger in 1997 than in 2013. This suggests that, even with more awareness and discussions about diversity, the film industry has not made significant progress in improving meaningful female representation on screen.

1.7 Analysis

In the analysis phase, we aim to applie statistical models to investigate the relationships between passing the Bechdel Test and various film performance metrics. Specifically, our focus is understanding if passing the Bechdel Test significantly correlates with higher revenue, more award wins, and improved IMDb ratings. Furthermore, we want to create predictive models for estimating a film’s likelihood of passing the Bechdel Test based on features like budget, genre, and director.

  1. Correlation Analysis for Bechdel Test Result and Revenue

    This code performs a Pearson correlation test to determine if there is a statistically significant relationship between Bechdel_test_result and Revenue.

    Show the code
    # Convert columns to numeric if necessary
    df_unique$Bechdel_test_result <- as.numeric(df_unique$Bechdel_test_result)
    df_unique$Revenue <- as.numeric(df_unique$Revenue)
    
    # Perform correlation test again
    correlation_test <- cor.test(df_unique$Bechdel_test_result, df_unique$Revenue, method = "pearson")
    print(correlation_test)
    
        Pearson's product-moment correlation
    
    data:  df_unique$Bechdel_test_result and df_unique$Revenue
    t = -3.9385, df = 1700, p-value = 8.532e-05
    alternative hypothesis: true correlation is not equal to 0
    95 percent confidence interval:
     -0.14196204 -0.04779078
    sample estimates:
            cor 
    -0.09508915 

    The Pearson correlation test between Bechdel_test_result and Revenue yielded a correlation coefficient of approximately -0.095, with a p-value of 8.532e-05. This result indicates a small but statistically significant negative correlation between passing the Bechdel Test and revenue. The 95% confidence interval for this correlation ranges from -0.141 to -0.048, suggesting that the true correlation is likely weakly negative.

    1.7.0.1 Key Findings:

    • Correlation Value: The correlation coefficient of -0.095 implies a weak inverse relationship, where films that pass the Bechdel Test tend to have slightly lower revenues on average. However, given the small magnitude of this correlation, the effect is minor.

    • Statistical Significance: The p-value (< 0.0001) shows that this negative correlation is statistically significant, meaning it is unlikely to be due to random chance.

    1.7.0.2 Conclusion:

    While there is a statistically significant relationship between passing the Bechdel Test and revenue, the effect size is minimal, suggesting that passing the test is not a strong predictor of a film’s financial performance. This finding implies that other factors (such as budget, genre, and marketing) likely play a much more substantial role in determining a film’s revenue than its performance on the Bechdel Test.

Logistic Regression for Awards and IMDb Ratings

In this section we will create some models to help us determine if passing the Bechdel Test is a predictor of critical acclaim. We will use a logistic regression for award model and The imdb_rating_model uses linear regression to estimate IMDb rating as a continuous variable

Show the code
# Create a binary variable `awards` indicating whether the film won any awards (1 if Wins > 0, else 0)
df_unique$awards <- ifelse(df_unique$Wins > 0, 1, 0)

# Logistic regression for binary outcome (awards)
award_model <- glm(awards ~ Bechdel_test_result + Budget + Genre, data = df_unique, family = binomial)
summary(award_model)

Call:
glm(formula = awards ~ Bechdel_test_result + Budget + Genre, 
    family = binomial, data = df_unique)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          3.843e-01  2.321e-01   1.656 0.097760 .  
Bechdel_test_result -9.735e-02  1.265e-01  -0.770 0.441407    
Budget               9.458e-09  1.559e-09   6.066 1.31e-09 ***
GenreAdventure       3.977e-01  2.605e-01   1.526 0.126923    
GenreAnimation       7.771e-01  3.027e-01   2.567 0.010264 *  
GenreBiography       3.009e+00  7.293e-01   4.126 3.70e-05 ***
GenreComedy          4.997e-01  1.731e-01   2.886 0.003897 ** 
GenreCrime           9.759e-01  2.919e-01   3.344 0.000827 ***
GenreDocumentary     9.244e-01  1.133e+00   0.816 0.414423    
GenreDrama           1.475e+00  2.261e-01   6.524 6.84e-11 ***
GenreFamily         -4.397e-01  1.424e+00  -0.309 0.757469    
GenreFantasy         6.641e-01  8.053e-01   0.825 0.409580    
GenreHorror          2.132e-01  2.613e-01   0.816 0.414598    
GenreMusic           1.384e+01  8.827e+02   0.016 0.987490    
GenreMusical         1.426e+01  6.238e+02   0.023 0.981765    
GenreMystery        -1.023e-01  5.805e-01  -0.176 0.860056    
GenreSci-Fi         -2.543e-01  9.349e-01  -0.272 0.785649    
GenreThriller       -1.599e+00  1.276e+00  -1.253 0.210250    
GenreWestern        -1.563e+01  8.827e+02  -0.018 0.985872    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1784.8  on 1701  degrees of freedom
Residual deviance: 1663.0  on 1683  degrees of freedom
AIC: 1701

Number of Fisher Scoring iterations: 13
Show the code
# Linear regression for continuous IMDb ratings
imdb_rating_model <- lm(Imdb_Rating ~ Bechdel_test_result + Budget + Genre, data = df_unique)
summary(imdb_rating_model)

Call:
lm(formula = Imdb_Rating ~ Bechdel_test_result + Budget + Genre, 
    data = df_unique)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8811 -0.5030  0.0438  0.5940  2.3965 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          6.863e+00  8.433e-02  81.381  < 2e-16 ***
Bechdel_test_result -2.748e-01  4.490e-02  -6.121 1.15e-09 ***
Budget               4.112e-10  4.582e-10   0.897 0.369625    
GenreAdventure       3.437e-01  9.379e-02   3.665 0.000255 ***
GenreAnimation       4.101e-01  9.395e-02   4.365 1.35e-05 ***
GenreBiography       9.451e-01  1.146e-01   8.248 3.21e-16 ***
GenreComedy          1.423e-01  6.692e-02   2.126 0.033611 *  
GenreCrime           7.653e-01  1.016e-01   7.533 8.03e-14 ***
GenreDocumentary     9.712e-01  4.010e-01   2.422 0.015537 *  
GenreDrama           6.709e-01  7.281e-02   9.215  < 2e-16 ***
GenreFamily         -2.743e-01  6.315e-01  -0.434 0.664037    
GenreFantasy        -1.603e-01  2.849e-01  -0.563 0.573785    
GenreHorror         -2.218e-01  1.072e-01  -2.070 0.038646 *  
GenreMusic          -1.837e+00  8.907e-01  -2.062 0.039351 *  
GenreMusical         4.313e-01  6.318e-01   0.683 0.494961    
GenreMystery         4.671e-01  2.423e-01   1.928 0.054077 .  
GenreSci-Fi          2.148e-01  4.007e-01   0.536 0.591982    
GenreThriller       -2.897e-01  5.152e-01  -0.562 0.573928    
GenreWestern         1.078e+00  8.901e-01   1.211 0.226105    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.889 on 1683 degrees of freedom
Multiple R-squared:  0.1336,    Adjusted R-squared:  0.1243 
F-statistic: 14.41 on 18 and 1683 DF,  p-value: < 2.2e-16

1.7.0.3 Logistic Regression Model for Awards

The logistic regression model predicts the likelihood of a film winning at least one award based on its Bechdel Test result, budget, and genre.

1.7.0.4 Key Findings:

  • Bechdel Test Result: The coefficient for Bechdel_test_result is -0.097, which is not statistically significant (p = 0.441). This indicates that passing the Bechdel Test does not have a meaningful impact on a film’s likelihood of winning awards.

  • Budget: The budget has a highly significant positive effect (p < 0.001). With an estimated coefficient of 9.458×10−99.458 ^{-9}9.458×10−9, films with higher budgets are more likely to win awards. Although the effect size seems small due to budget scaling, it remains a strong predictor.

  • Genres: Several genres show significant effects on award wins:

    • Animation, Biography, Comedy, Crime, and Drama all have positive and statistically significant coefficients, suggesting these genres are more likely to win awards.

    • Documentary, Horror, Fantasy, and others do not show significant effects, indicating no clear relationship with award success.

1.7.0.5 Conclusion:

Budget and certain genres are significant predictors of award wins, whereas passing the Bechdel Test does not seem to increase a film’s award potential. This implies that industry recognition through awards is more influenced by production investment and genre than by diversity indicators like the Bechdel Test.

1.7.0.6 Linear Regression Model for IMDb Ratings

The linear regression model assesses how a film’s Bechdel Test result, budget, and genre are related to IMDb ratings.

1.7.0.7 Key Findings:

  • Bechdel Test Result: The coefficient for Bechdel_test_result is -0.275, which is statistically significant (p < 0.001). This negative coefficient suggests that films passing the Bechdel Test have slightly lower IMDb ratings on average, although the effect size is small.

  • Budget: The budget coefficient is not statistically significant (p = 0.369), indicating that budget alone does not strongly influence IMDb ratings.

  • Genres: Several genres have a significant impact on IMDb ratings:

    • Adventure, Animation, Biography, Comedy, Crime, Documentary, and Drama are all positively associated with higher IMDb ratings.

    • Horror and Music have negative, significant coefficients, suggesting that films in these genres tend to have slightly lower IMDb ratings on average.

1.7.0.8 Model Fit:

The adjusted R2R^2R2 value of 0.124 indicates that only about 12.4% of the variability in IMDb ratings is explained by this model, suggesting other unaccounted factors may significantly influence IMDb ratings.

1.7.0.9 Conclusion:

While passing the Bechdel Test is associated with a slight decrease in IMDb ratings, the effect is minimal. Genre remains a more influential factor, with genres like Biography and Documentary positively associated with higher ratings. This outcome suggests that audience ratings on IMDb are more closely tied to genre preferences than to a film’s adherence to gender diversity metrics.

1.7.0.10 Overall Conclusion

From both models, it is clear that genre and budget are more powerful predictors of a film’s critical success (as measured by awards and IMDb ratings) than passing the Bechdel Test. While the Bechdel Test serves as a valuable tool for analyzing gender representation, these findings suggest that its impact on industry recognition and audience ratings is limited. Future analyses could explore other diversity metrics or interactions to provide a more nuanced understanding of representation’s role in film success.

1.7.0.11 Predictive Model for Film Success (Revenue, Awards, IMDb Ratings)

In this section we will build a predictive model to estimate a film’s success on revenue terms based on the Bechdel Test results, budget, genre of fils, and gender of directors.

Show the code
# Linear regression for Revenue (continuous outcome)
revenue_model <- lm(Revenue ~ Bechdel_test_result + Budget + Genre + Gender, data = df_unique)
summary(revenue_model)

Call:
lm(formula = Revenue ~ Bechdel_test_result + Budget + Genre + 
    Gender, data = df_unique)

Residuals:
       Min         1Q     Median         3Q        Max 
-603979092  -91116228  -31182619   25947503 2921873476 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -2.316e+06  3.277e+07  -0.071   0.9437    
Bechdel_test_result -6.545e+06  1.172e+07  -0.559   0.5765    
Budget               3.043e+00  1.180e-01  25.789  < 2e-16 ***
GenreAdventure       9.431e+07  2.411e+07   3.911 9.55e-05 ***
GenreAnimation       5.082e+07  2.417e+07   2.102   0.0357 *  
GenreBiography      -7.967e+06  2.954e+07  -0.270   0.7874    
GenreComedy          1.129e+07  1.724e+07   0.655   0.5128    
GenreCrime           1.023e+07  2.612e+07   0.392   0.6952    
GenreDocumentary    -3.485e+07  1.032e+08  -0.338   0.7356    
GenreDrama           9.060e+05  1.874e+07   0.048   0.9614    
GenreFamily         -4.630e+07  1.626e+08  -0.285   0.7759    
GenreFantasy        -7.130e+07  7.350e+07  -0.970   0.3322    
GenreHorror          5.562e+07  2.759e+07   2.016   0.0440 *  
GenreMusic          -1.510e+08  2.299e+08  -0.657   0.5113    
GenreMusical         7.051e+08  1.624e+08   4.341 1.50e-05 ***
GenreMystery         2.921e+07  6.230e+07   0.469   0.6393    
GenreSci-Fi         -7.022e+07  1.030e+08  -0.682   0.4955    
GenreThriller       -1.687e+08  1.326e+08  -1.273   0.2033    
GenreWestern        -1.841e+08  2.288e+08  -0.805   0.4211    
Gender1              2.480e+07  2.317e+07   1.070   0.2847    
Gender3              7.814e+07  3.221e+07   2.426   0.0154 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 228500000 on 1681 degrees of freedom
Multiple R-squared:  0.3751,    Adjusted R-squared:  0.3677 
F-statistic: 50.45 on 20 and 1681 DF,  p-value: < 2.2e-16

1.7.0.12 Key Findings:

Budget and certain genres appear to significantly as in the model before, they influence revenue. Specifically, the budget has a strong positive effect on revenue, with a coefficient estimate of 3.04 (p < 2e-16).

Genres such as Adventure, Animation, Horror, and Musical also significantly affect revenue, with Adventure and Musical showing the strongest positive relationships (p-values < 0.05).

Bechdel Test Result, Gender, and many other genres (e.g., Biography, Comedy, Drama) were not statistically significant (p > 0.05). This suggests that the Bechdel test result and gender have less influence on revenue in this dataset..

1.7.0.13 Conclusion:

The linear regression analysis of revenue reveals that budget and specific genres are the most significant predictors of a film’s revenue. A higher budget is strongly associated with increased revenue, highlighting the crucial role of financial investment in a film’s commercial success. Genres such as Adventure, Animation, and Musical also show significant positive relationships with revenue, indicating that films in these genres tend to perform better financially. On the other hand, Bechdel Test Result and Gender do not appear to significantly influence revenue, suggesting that factors like gender representation and passing the Bechdel test may not directly affect a film’s box office performance in this dataset.

1.7.0.14 Logistic regression Model for Gender

Show the code
# Logistic regression to predict the likelihood of passing the Bechdel Test based on director gender
model <- glm(Bechdel_binary ~ Gender, data = df_gender, family = binomial)
summary(model)

Call:
glm(formula = Bechdel_binary ~ Gender, family = binomial, data = df_gender)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.3643     0.2336   5.839 5.24e-09 ***
Gender1      -1.7203     0.2395  -7.183 6.80e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2202.4  on 1605  degrees of freedom
Residual deviance: 2137.4  on 1604  degrees of freedom
AIC: 2141.4

Number of Fisher Scoring iterations: 4

The logistic regression model indicates a significant relationship between director gender and the likelihood of a movie passing the Bechdel Test. The intercept (1.3643) suggests that movies directed by females have higher odds of passing the test, serving as a baseline for comparison. The Gender1 coefficient (-1.7203), representing male directors, is negative and statistically significant (p < 0.001), indicating that movies directed by males are associated with lower odds of passing the Bechdel Test compared to those directed by females. Specifically, male-directed films have approximately 82% lower odds of passing, highlighting a meaningful gender-based difference in female representation in film.

1.8 Conclusion and next steps

This study investigates the financial and critical impact of female representation in films, as assessed by Bechdel Test outcomes, focusing on correlations with revenue, awards, director’s gender and IMDb ratings. Through rigorous data collection and preprocessing, we have established a robust dataset of films enriched with attributes like budget, genre, director, and gender, along with their Bechdel Test results, enabling a nuanced examination of gender representation’s potential influence on a film’s success.

Our initial analyses indicate that passing the Bechdel Test correlates with positive trends in both revenue and critical reception. Specifically, preliminary findings reveal that films passing the test tend to achieve higher revenues and more favorable IMDb ratings, suggesting a commercially and critically advantageous effect associated with female representation. Early predictive modeling has shown promising accuracy, particularly in using variables such as budget, genre, and director’s gender to predict both a film’s likelihood of passing the Bechdel Test and its potential revenue outcomes. Additionally, trends observed across various genres and budget levels hint at complex interactions between gender representation and film success, warranting further exploration.

Moving forward, we aim to refine these predictive models to enhance their precision and to conduct deeper analyses of specific genres and budget brackets. This will help clarify how female representation might drive both economic and critical outcomes in different contexts. Visualizations of these refined trends will also be developed to highlight the broader implications of gender representation in film. Ultimately, the insights from this study are expected to offer valuable, data-driven perspectives on the role of diversity in the film industry, shedding light on both social and commercial aspects of gender representation.